Our research question asks: Between 2001 and 2021, have state-level overdose prevention policies (such as Naloxone access laws and public treatment services) been associated with reductions in statewide opioid fatality rates?
The EDA is conducted across 3 datasets, which contain the dependent variable of statewide overdose rates (current file), and the predictor variables of naloxone access policies and treatment services.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import json
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
import plotly.io as pio
/opt/conda/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.10.3-CAPI-1.16.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow.
kff_path = 'data/overdose_data.csv'
df = pd.read_csv(kff_path)
#drop unnecessary rows
df = df.head(52)
This data was generated by KFF, a non-partisan (albeit slightly republican leaning) organization focused on health policy research and polling. Collected data dates back to 1999.
This data was collected and compiled with the intent to track the mortality impact of opioids and opioid use across American states with the intent to be utilized for state policy, medical policy, or informed journalism. This data is directly applicable to our research question, as being able to track the deaths by state by year will ideally reflect the efficacy of the harm-reduction policies implemented contemporaneously to prevent opioid related deaths. This is one of the central goals of our research question.
A strength of this data is that it's very clean and straightforward with only three columns, simply reflecting the number of opioid-related deaths per 100,000 residents of a state across around two decades. The main dependent variable is the number of deaths per 100k residents, with the corresponding state and year as the only other features.
A potential weakness is that this data has a relatively significant proportion of null values for certain states in earlier years of recording, which may degrade the accuracy of our assessment models.
This data was collected through the Vital Statistics Cooperative Program and was provided by the 57 vital statistic jurisdictions. Data was also sourced from the publically available Multiple Cause of Death Files by the CDC, which are strongly regulated by federal law to protect privacy and prevent misuse. Thus, there is no reason to suspect any divergence from ethical data collection standards.
"""
Cell Description:
The dataset that's read in is in a format similar to that of a pivot table.
This cell deconstructs this pivot table into the features below:
'Location', 'Year', 'Deaths_Per_100k'
"""
result = pd.melt(df,
id_vars='Location',
value_vars=list(df.columns[1:]), # list of days of the week
var_name='Year',
value_name='Deaths_Per_100k')
result
| Location | Year | Deaths_Per_100k | |
|---|---|---|---|
| 0 | United States | 1999 | 2.9 |
| 1 | Alabama | 1999 | 0.8 |
| 2 | Alaska | 1999 | 4 |
| 3 | Arizona | 1999 | 4.7 |
| 4 | Arkansas | 1999 | 1.1 |
| ... | ... | ... | ... |
| 1191 | Virginia | 2021 | 26.0 |
| 1192 | Washington | 2021 | 20.5 |
| 1193 | West Virginia | 2021 | 77.2 |
| 1194 | Wisconsin | 2021 | 25.9 |
| 1195 | Wyoming | 2021 | 12.4 |
1196 rows × 3 columns
"""
Cell Description:
We replace data-specific labels with np.nan's.
We divide the dataset into one with and one without the aggregated 'US' Location.
We get the correct time frame for our data.
We sort for ffill later on.
"""
result['Deaths_Per_100k'] = result['Deaths_Per_100k'].replace(['NR', 'NSD'], np.nan)
result['Year'] = result['Year'].astype(int)
result['Deaths_Per_100k'] = result['Deaths_Per_100k'].astype(float)
result_with_us = result.copy()
result = result[result['Location'] != 'United States']
result = result[result['Year'] >= 2001]
result.sort_values(['Location', 'Year'], inplace=True)
Fortunately, there is a low percentage of null values across the dataset. North Dakota had the highest proportion of nulls, and most of the null values tend to skew towards the earlier years. As such, we decided to use the forward fill function to inpute values with the next avaiable data point, because death rates increase over time but tended to be relatively stable in earlier years.
print('Total Number of Nulls: ', result.isna().sum().sum())
print('Percent of Nulls: ', result.isna().sum().sum()/result.shape[0])
Total Number of Nulls: 29 Percent of Nulls: 0.02707749766573296
"""
Cell Description:
We plot the number of nulls in our target variable per state.
"""
null_states = result.groupby('Location').agg(lambda x: x.isna().sum())['Deaths_Per_100k']
null_states[null_states>0].sort_values().plot.bar(title='Total Null Values per State', ylabel='count nulls')
<AxesSubplot:title={'center':'Total Null Values per State'}, xlabel='Location', ylabel='count nulls'>
"""
Cell Description:
We plot the number of nulls per year.
"""
null_year = result.groupby('Year').agg(lambda x: x.isna().sum())['Deaths_Per_100k']
null_year[null_year>0].plot.bar(title='Total Null Values Per Year', ylabel='count null')
<AxesSubplot:title={'center':'Total Null Values Per Year'}, xlabel='Year', ylabel='count null'>
"""
Cell Description:
We use forward fill to fill in null values in our target variable.
"""
result.loc[:,'Deaths_Per_100k'] = result.loc[:,'Deaths_Per_100k'].ffill()
print('Total Number of Nulls: ', result.isna().sum().sum())
Total Number of Nulls: 0
Summary Stats
result.agg({"Deaths_Per_100k": ["min", "max", "median", "skew"],})
| Deaths_Per_100k | |
|---|---|
| min | 0.700000 |
| max | 77.200000 |
| median | 7.800000 |
| skew | 2.041374 |
result['Deaths_Per_100k'].describe()
count 1071.000000 mean 10.996639 std 9.359158 min 0.700000 25% 4.950000 50% 7.800000 75% 13.550000 max 77.200000 Name: Deaths_Per_100k, dtype: float64
result.shape
(1071, 3)
From the histogram below and the summary stats, we see that the data is heavily right skewed, with the median being 7.8 deaths per 100k people but the max being 77.2 deaths; resulting in a high standard deviation of 9.35. The max value is significantly greater than the 75th percentile mark of 13.55
deathshist = result['Deaths_Per_100k'].hist()
plt.title("Histogram of Deaths_Per_100k")
plt.ylabel("Count")
plt.xlabel("Number of Deaths")
Text(0.5, 0, 'Number of Deaths')
plt.violinplot(result['Deaths_Per_100k'])
plt.title("Violin Plot of Deaths_Per_100k")
plt.xlabel("Density")
plt.ylabel("Count")
Text(0, 0.5, 'Count')
Between 2001 and 2021, nationwide overdose rates grew exponentially; increasing by 150% between just 2015 and 2021.
nat_deaths = result_with_us[result_with_us['Location']=='United States']
fig = px.line(nat_deaths, x="Year", y="Deaths_Per_100k", title='Nationwide Deaths Per 100k')
fig.show(renderer='notebook')
Within individual states, death rates also seemed to increase rather rapidly in recent years. Kentucky, Tenessee, Maine, and Deleware experience higher overdose rates at around 44 deaths per 100k, while West Virginia surpasses the rest with 77 deaths per 100k.
us_state_to_abbrev = {
"Alabama": "AL",
"Alaska": "AK",
"Arizona": "AZ",
"Arkansas": "AR",
"California": "CA",
"Colorado": "CO",
"Connecticut": "CT",
"Delaware": "DE",
"Florida": "FL",
"Georgia": "GA",
"Hawaii": "HI",
"Idaho": "ID",
"Illinois": "IL",
"Indiana": "IN",
"Iowa": "IA",
"Kansas": "KS",
"Kentucky": "KY",
"Louisiana": "LA",
"Maine": "ME",
"Maryland": "MD",
"Massachusetts": "MA",
"Michigan": "MI",
"Minnesota": "MN",
"Mississippi": "MS",
"Missouri": "MO",
"Montana": "MT",
"Nebraska": "NE",
"Nevada": "NV",
"New Hampshire": "NH",
"New Jersey": "NJ",
"New Mexico": "NM",
"New York": "NY",
"North Carolina": "NC",
"North Dakota": "ND",
"Ohio": "OH",
"Oklahoma": "OK",
"Oregon": "OR",
"Pennsylvania": "PA",
"Rhode Island": "RI",
"South Carolina": "SC",
"South Dakota": "SD",
"Tennessee": "TN",
"Texas": "TX",
"Utah": "UT",
"Vermont": "VT",
"Virginia": "VA",
"Washington": "WA",
"West Virginia": "WV",
"Wisconsin": "WI",
"Wyoming": "WY",
"District of Columbia": "DC",
}
#create df for choropleth plotting, with state names replaced with proper codes
cp_df = result.copy()
cp_df.replace({'Location':us_state_to_abbrev}, inplace=True)
cp_df
| Location | Year | Deaths_Per_100k | |
|---|---|---|---|
| 105 | AL | 2001 | 1.3 |
| 157 | AL | 2002 | 1.6 |
| 209 | AL | 2003 | 1.1 |
| 261 | AL | 2004 | 1.8 |
| 313 | AL | 2005 | 1.8 |
| ... | ... | ... | ... |
| 987 | WY | 2017 | 8.7 |
| 1039 | WY | 2018 | 6.8 |
| 1091 | WY | 2019 | 8.3 |
| 1143 | WY | 2020 | 10.6 |
| 1195 | WY | 2021 | 12.4 |
1071 rows × 3 columns
min = cp_df['Deaths_Per_100k'].min()
max = cp_df['Deaths_Per_100k'].max()
fig = px.choropleth(cp_df,
locations = 'Location',
color= 'Deaths_Per_100k',
animation_frame="Year",
color_continuous_scale="Sunset",
locationmode='USA-states',
scope="usa",
range_color=(min, max),
title='Deaths Per 100k Over Time',
height=600
)
fig.show(renderer='notebook')
Change in Deaths per State Over Time
"""
Cell Description:
Creating Pivot Table so we can create a pct_change dataframe.
"""
result_pivot = pd.pivot_table(result,
values='Deaths_Per_100k',
index='Year',
columns='Location',
aggfunc=np.sum)
result_pivot = result_pivot.astype(float)
result_pivot.isna().sum().sum()
0
"""
Cell Description:
Creating pct_change() table
"""
result_pivot_chg = result_pivot.pct_change()
"""
Cell Description:
Plotting pct_change() table(s)
"""
for i in range(0, result_pivot_chg.index.size, int(result_pivot_chg.index.size/5)):
start_index, end_index = i, i + int(result_pivot_chg.index.size/5)
print("-------------")
print("Start Index: ", start_index)
print("End Index: ", end_index)
subset_pct = result_pivot_chg.iloc[:, start_index:end_index]
subset_pct.plot(xticks=subset_pct.index, ylabel='Pct Change', figsize=(8, 8), rot=45)
plt.show()
# result_pivot_chg.plot(xticks=result_pivot_chg.index, ylabel='Pct Change', figsize=(8, 8), rot=45)
# plt.legend(ncol=3, fontsize=5)
# plt.show()
------------- Start Index: 0 End Index: 4
------------- Start Index: 4 End Index: 8
------------- Start Index: 8 End Index: 12
------------- Start Index: 12 End Index: 16
------------- Start Index: 16 End Index: 20
------------- Start Index: 20 End Index: 24